??? (TODO), 2019

Today's schedule

TODO: nachträglich anpassen

  1. Introduction
  2. A short introduction to plotting with ggplot2
  3. Hello world: Plotting points on a world map
  4. Sources and file formats for geographic data

– lunch break –

  1. A short introduction to data linkage with dplyr
  2. Combining data and making a choropleth map
  3. Geo-coding and reverse geo-coding via Google Maps API
  4. Coordinate reference systems and projections

Introduction

About me

TODO

Aims of today's workshop

TODO

A short introduction to plotting with ggplot2

Plotting with ggplot2

There are three basic steps for constructing plots with ggplot2:

  1. Supply a data set you want to plot to ggplot().
  2. Define an aesthetic mapping with aes().
    This describes how variables of your data are mapped to visual properties, e.g. variable "age" is plotted on the x-axis and "income" on the y-axis.
  3. Add layers of geoms (geometrical objects) that describe which graphical primitives to use (e.g. points in a scatter plot or bars in a bar plot).

Additionally, you can further change the appearance of your plot by:

  • altering the scales (e.g. use a logarithmic scale, modify display of factors, etc.)
  • defining facets → create small multiples, each showing a different subset of the data
  • changing the overall appearance of the plot by adjusting its theme (e.g. change background color, rotate axis labels, etc.)

You combine all these steps with a +.

General concepts behind ggplot2

ggplot2 components - 1

General concepts behind ggplot2

ggplot2 components - 2

General concepts behind ggplot2

ggplot2 components - 3

General concepts behind ggplot2

ggplot2 components - 4

General concepts behind ggplot2

ggplot2 components - 5

General concepts behind ggplot2

ggplot2 components - 6

General concepts behind ggplot2

ggplot2 components - 7

Hello world: Plotting points on a world map

Coordinate reference systems in brief

We usually work with two-dimensional vector data: Points located within a coordinate reference system (CRS).

  • the point -73.94°, 40.6° is represented in the WGS84 CRS: most popular world coord. system (e.g. used by GPS)
  • this CRS is ellipsoidal; its components are given in degrees:
    • -73.94° is the longitude: about 74° West from the meridian
    • 40.6° is the latitude: about 41° North from the equator

Coordinate reference systems in brief

Let's put this point into context:

  • points can be connected to form other geometrical shapes such as lines or polygons
  • points and shapes can represent numerous geographic entities: cities, buildings, countries, rivers, lakes, etc.

Packages, packages, packages

We need to extend R in order to work with geographic data (geo-data) in R by installing these packages:

  • maps: contains geo-data for national borders and administrative regions for several countries
  • rgeos and maptools: Misc. functions for operations on geometries
  • sf: "Simple Features for R" – reading, writing and transforming spatial data

Install them if you haven't yet:

install.packages(c('maps', 'maptools', 'sf', 'rgeos'))

Simple features

Most packages for working with geo-data in R rely on the Simple Features standard (most prominently the sf package).

Simple features describe how objects in the real world can be represented in computers in terms of their spatial geometry:

Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties. – sf package vignette

Examples:

  • geometry: point at -73.94°, 40.6°
  • name: New York City
  • population: 8.6 mio.


  • geometry: polygons consisting of points at …
  • name: Italy
  • most popular meal: pizza

Making a world map

First, we load the packages that we need:

library(ggplot2)
library(maps)
library(sf)

Making a world map

The function map can be used to load the "world" data. We need to convert it to a "simple features" object via st_as_sf():

worldmap_data <- st_as_sf(map('world', plot = FALSE, fill = TRUE))
head(worldmap_data, n = 3)  # to have a look inside
## Simple feature collection with 3 features and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -8.68335 ymin: 18.98662 xmax: 74.89131 ymax: 42.64795
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                         geometry          ID
## 1 MULTIPOLYGON (((74.89131 37... Afghanistan
## 2 MULTIPOLYGON (((20.06396 42...     Albania
## 3 MULTIPOLYGON (((8.576563 36...     Algeria
  • dataset with a "header" giving some general information
  • two columns: geometry (spatial data) and ID ("meta data")

What do the numbers in the geometry column mean? What's their CRS?

Making a world map

We are ready to plot the world map. Every "simple features" object can be plotted by adding a geom_sf layer:

ggplot() + geom_sf(data = worldmap_data)

Putting points on the map

We have some cities along with their longitude (lng) and latitude (lat):

##       name    lng    lat
## 1   Berlin  13.38  52.52
## 2 New York -73.94  40.60
## 3   Sydney 151.21 -33.87

We add a "point geom" layer to our map:

ggplot(some_cities) +                                # pass the cities data to plot
    geom_sf(data = worldmap_data) +                  # layer 1: world countries
    geom_point(aes(x = lng, y = lat), color = 'red') # layer 2: points at city coord.

Adding labels next to the points

You may also add text labels for the cities. ggplot2 provides geom_text and geom_label (draws box around text):

ggplot(some_cities) +                                  # pass the cities data to plot
    geom_sf(data = worldmap_data) +                    # layer 1: world countries
    geom_point(aes(x = lng, y = lat), color = 'red') + # layer 2: points
    geom_label(aes(x = lng, y = lat, label = name),    # layer 3: labels
               hjust = 0, vjust = 1, nudge_x = 3) +    # labels appear next to point
    theme(axis.title = element_blank(),
          axis.text = element_blank())                 # disable the axis labels

Excercise 1

TODO: Anpassen / 2 mgl. Übungen?

Visualizing WZB partner institutions on a world map

Make a script that:

  1. Load the CSV file wzb_partners_data.csv
  2. Load the data for the world map
  3. Remove the shape (i.e. the "feature") for Antarctica from the world map data
  4. Make a plot, showing the partner institutions with a dot on the world map (optional: make the color dependent on the "type" column)
  5. Optional: Group institutions by city and count institutions per city; show the city locations in Europe only as a dot, make the dot size dependent on number of institutions

See further hints on handout.

Sources and file formats for geo-data

Administrative levels and identifiers

Suppose you have observations that refer to geographic locations, e.g. countries, cities, neighborhoods. You may want to:

  • link that data with existing information on that geographic entity e.g. country GDP, city population, poverty rate in the neighboorhood
  • visualize that data geographically

For both things you need to:

  1. agree on geographic (administrative) level for matching (e.g. country, state, municipality)
  2. find identifiers to match the observations w/ respective geographic entities

TODO: schematische darstellung

Country names, country codes

  • different names may refer to the same country (e.g. Germany / Federal Republic of Germany, BRD, etc.) → often not a good identifier
  • ISO 3166-1 designates every country a two- or three-letter code (e.g. DE / DEU)
  • often used in datasets (e.g. from the UN)

Finding NUTS in the EU

The Nomenclature of Territorial Units for Statistics (NUTS) divides the EU territory into regions at 3 different levels for socio-economic analyses of the regions.

Candidate countries and countries inside EFTA also have NUTS regions (Norway, Switzerland, Albania, Turkey, etc.).

NUTS levels
source: Eurostat

Finding NUTS in the EU

AGS in Germany

In Germany, regions are hierarchically structured and identified by "Amtlicher Gemeindeschlüssel" (AGS ~ "municipality identificator"). The Gemeindeverzeichnis contains all regions with their name, AGS, area, population and other features.

Berlin: LOR codes

Since 2006, Berlin uses "Lebensweltlich orientierte Räume" (LOR) to structure the city area into sub-regions at different levels, each of them identified by "LOR" code.

  1. Prognoseräume (60 regions)
  2. Bezirksregionen (138 regions)
  3. Planungsräume (447 regions)

What if I have no geo-identifier?

If your observations don't contain the identifiers you need, you can still do the following as long as you have some information suitable for geo-coding (e.g. address, city name, institution name to convert to a geo-coordinate):

  1. find out the longitude/latitude coordinate for each observation
  2. match each coordinate to its surrounding geographic entity (i.e. a municipality) by doing a "point-within-polygon" test

Sources for geo-data

Have look in the handout document, where geo-data sources at world- and EU-level (NUTS) as well as sources for Germany (AGS) and Berlin (LOR) are presented.

File formats

There are numerous file formats that store data for geographic information system (GIS) software. The most popular include:

  • ESRI shapefile ("SHP"):
    • consists of at least three files in the same directory with filename extensions .shp, .shx and .dbf
  • GeoJSON (.geojson, .json) and TopoJSON (.topojson, .json)
  • KML (.kml, .kmz): Keyhole Markup Language, used by Google Earth
  • GML (.gml): Geography Markup Language
  • AutoCAD DXF (.dxf)

You may also encounter other formats such as vector data files (.svg, .poly) or databases (.sql, .sqlite).

Good news: sf can handle all popular file formats.

Exploring geo-data files with QGIS

QGIS is a free, open-source GIS software running on all major operating systems.

  • useful for exploring downloaded geo-data files
  • I will show how to load a geo-data file and investigate properties of individual shapes
  • if you've installed QGIS, you can follow along

QGIS screenshot

Lunch break

A short introduction to data linkage with dplyr

Data linkage with dplyr

Two datasets:

Particular matter: "pm" dataset
city pm_mg
Amman 999
Saltillo 869
Usak 814
The Bronx 284
Seoul 129
GPS coordinates of cities: "city_coords" dataset
city lng lat
Amman 31.91034 35.94846
Saltillo 1.00000 1.00000
Berlin 1.00000 1.00000
Usak 1.00000 1.00000
New York 1.00000 1.00000
Seoul 1.00000 1.00000

  • to combine these datasets, the "city" column might be used as identifier for matching
  • not all cities in "pm" also appear in "city_coords" and vice versa

Join operations

pm
city pm_mg
Amman 999
Saltillo 869
Usak 814
The Bronx 284
Seoul 129
city_coords
city lng lat
Amman 31.91034 35.94846
Saltillo 1.00000 1.00000
Berlin 1.00000 1.00000
Usak 1.00000 1.00000
New York 1.00000 1.00000
Seoul 1.00000 1.00000


Datasets can also be joined with merge, but I find dplyr easier to use.

left_join(a, b, by = <criterion>): always retains rows on the "left side" and fills up non-matching rows with NAs.

library(dplyr)
left_join(pm, city_coords, by = 'city')
##        city pm_mg      lng      lat
## 1     Amman   999 31.91034 35.94846
## 2  Saltillo   869  1.00000  1.00000
## 3      Usak   814  1.00000  1.00000
## 4 The Bronx   284       NA       NA
## 5     Seoul   129  1.00000  1.00000

Join operations

pm
city pm_mg
Amman 999
Saltillo 869
Usak 814
The Bronx 284
Seoul 129
city_coords
city lng lat
Amman 31.91034 35.94846
Saltillo 1.00000 1.00000
Berlin 1.00000 1.00000
Usak 1.00000 1.00000
New York 1.00000 1.00000
Seoul 1.00000 1.00000


right_join(a, b, by = <criterion>): always retains rows on the "right side" and fills up non-matching rows with NAs. How many rows do you expect for a right join between pm and city_coords? How many of them will contain an NA value?

right_join(pm, city_coords, by = 'city')
##       city pm_mg      lng      lat
## 1    Amman   999 31.91034 35.94846
## 2 Saltillo   869  1.00000  1.00000
## 3   Berlin    NA  1.00000  1.00000
## 4     Usak   814  1.00000  1.00000
## 5 New York    NA  1.00000  1.00000
## 6    Seoul   129  1.00000  1.00000

Join operations

pm
city pm_mg
Amman 999
Saltillo 869
Usak 814
The Bronx 284
Seoul 129
city_coords
city lng lat
Amman 31.91034 35.94846
Saltillo 1.00000 1.00000
Berlin 1.00000 1.00000
Usak 1.00000 1.00000
New York 1.00000 1.00000
Seoul 1.00000 1.00000


inner_join(a, b, by = <criterion>): only retains rows that match on both sides.

How many rows do you expect for an inner join between pm and city_coords?

inner_join(pm, city_coords, by = 'city')
##       city pm_mg      lng      lat
## 1    Amman   999 31.91034 35.94846
## 2 Saltillo   869  1.00000  1.00000
## 3     Usak   814  1.00000  1.00000
## 4    Seoul   129  1.00000  1.00000

Geographic coordinate systems and projections

Vector data

Coordinate reference systems and units

(norway <- worldmap_data[worldmap_data$ID == 'Norway',])
## Simple feature collection with 1 feature and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -9.098877 ymin: 58.02095 xmax: 33.6293 ymax: 80.47783
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                           geometry     ID
## 166 MULTIPOLYGON (((5.08584 60.... Norway
ggplot() + geom_sf(data = norway)

Projections

Further reading